README for DICL Database

Description:
	The Domestic and International Common Language (DICL) Dataset contains a collection of country-level and bilateral
    measures of language connections. The 11 DICL indices reflect multiple dimensions of linguistic relationships between
    populations, including common official languages, common spoken languages, and intelligibility between different 
    languages. Several measures also differentiate between native languages and acquired languages.

Licensing:
	CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Recommended citation:
	Gurevich, T., P.R. Herman, F. Toubal, and Y.V. Yotov. (2024) "The Domestic and International Common 
    Language (DICL) Database." USITC Economics Working Paper 2024-03-A.

Contents:
	The DICL database contains 15 columns and 58,564 rows comprised of indices for 29,403 unique country pairs. For
    convenience, the international records are mirrored so that there is a record for both the pair (i, j) and (j, i). 
    The domestic records appear once for each country (2 * 29,161 mirrored international records + 242 domestic measures 
    = 58,564 total records). The 12 columns contain each of the 8 language measures described in the previous section 
    as well as names and ISO 3-digit alpha identifiers for each country. The first row of each column contains a column label.

Variables:
	iso3_i:	Country i ISO 3-digit alpha identifier
	country_i:	Country i name
	iso3_j:	Country j ISO 3-digit alpha identifier
	country_j:	Country j name
	col:	Common official language indicator
	cor:	Restricted official lang indicator based on narrower definition of official lang
	cnl:	Common native language index
	cal:	Common acquired language index
	csl:	Common spoken language index (native and acquired)
	lpn:	Linguistic proximity index for different native languages
	lpa:	Linguistic proximity index for different acquired languages
	lps:	Linguistic proximity index for different spoken languages (native and acquired)
	bpn:	Linguistic branch proximity index for different native languages
	bpa:	Linguistic branch proximity index for different acquired languages
	bps:	Linguistic branch proximity index for different spoken langs. (nat. and acq.)


Technical definitions:	
    The official language indicators (col and cor) are defined based on the statuses of official languages in each country.
    col is defined such that col_ij = 1 if country i and j share at least one official language, whether national or 
    provincial, statutory or de facto. Otherwise, col_ij = 0. cor is defined such that cor_ij = 1 if country i and j share 
    at least one national statutory or de facto official language. otherwise, cor_ij = 0. For the domestic indices (i=j),
    col and cor are set equal to 1 if the country has at least one official language, and is zero otherwise.
    
    The common language indices (cnl, cal, and csl) are defined based on the population of speakers of each language in each
    country. Let l_ki denote the share of native language speakers of language k in country i. cnl is defined as 
    cnl_ij = sum_k (l_ki x l_kj) for all k. cal is defined as cal_ij = sum_k (a_ki x a_kj) for all k, where a_ki denotes 
    the share of acquired language speakers of language k in country i. Finally, csl is defined as 
    csl_ij = sum_k [(l_ik + a_ki) x (l_jk + a_kj)] for all k. 
    
    The linguistic proximity indices (lpn, lpa, and lps) are defined based on the structure of their respective linguistic 
    trees. Let b_k denote the number of branches in the linguistic family tree for language k. Let b_kh denote the number 
    of branches that are common to two languages k and h. The linguistic proximity between these languages is given by 
    P_kh = b_kh / [0.5 x (b_k + b_h)], reflecting the average proportion of the two language trees that they share. Using 
    the notation defined above, the lpn index is defined as lpn_ij = sum_k sum_{h!=k} (l_ik x l_jh x P_kh) for all k and h. 
    Notably, cases where k=h are excluded so that the measure does not include same-language pairs. The lpa index is 
    similarly defined as  lpa_ij = sum_k sum_{h!=k} (a_ik x a_jh x P_kh)  for all k and h. Finally, the lps index is 
    defined as lps_ij = sum_k sum_{h!=k} [(l_ik + a_ik) x (l_jh + a_jh) x P_kh]  for all k and h. 
    
    The branch proximity indices (bpn, bpa, and bps) are defined using the same linguistic tree information as the 
    linguistic proximity indices. Letting M denote the maximum tree length across all languages (M = max(b_k)), define 
    branch proximity as R_kh = (M - [0.5 x (b_k + b_h) - b_kh]) / M. If two languages are from different language trees 
    and share no common branches, R_kh = 0. Using branch proximity, the bpn index is defined as
    bpn_ij = sum_k sum_{h!=k} (l_ik x l_jh x B_kh). The bpa index is similarly defined as  
    bpa_ij = sum_k sum_{h!=k} (a_ik x a_jh x R_kh)  for all k and h. Finally, the bps index is defined as 
    bps_ij = sum_k sum_{h!=k} [(l_ik + a_ik) x (l_jh + a_jh) x R_kh]  for all k and h. 


